A Gentle Introduction to Conformal Prediction

Anastasios N. Angelopoulos & Stephen Bates

Conformal Prediction (CP): simple but powerful uncertainty quantification

A framework for creating prediction intervals or sets with statistical coverage guarantees.

  • Goal: Ensure that the true value lies within the prediction set.
  • Mariginal Coverage: For a desired miscoverage rate \(\alpha\), CP ensures that the probability that the prediction set contains the correct label is almost exactly \(1 − \alpha\). \[1 - \alpha \;\leq\; P(Y_{test} \in C(X_{test})) \;\leq\; 1 - \alpha + \frac{1}{n+1}\]

Benefits of Conformal Prediction

  • Model-agnostic (assumes the model is a black-box)
  • Finite-sample guarantees
  • Computationally cheap
  • No need to modify your training procedure
  • Can be performed on any dataset of any size
  • Caveat: assumes you have a way to define nonconformity scores for your data/model

Instructions for Conformal Prediction

  1. Model Training: Train a predictive model on the training dataset.
  2. Nonconformity Measure: Define a measure \(s(x, y) \in \mathbb{R}\) to quantify how unusual a new example is relative to the training data.
  3. Calibration: Use a separate calibration set to compute nonconformity scores.
  4. Compute \(\hat{q} = \frac{\lceil{(n+1)(1-\alpha)}\rceil}{n}\) of the calibration scores \(\small s_1 = s(X_1, Y_1), ..., s_n = s(X_n, Y_n)\).
  5. Prediction Set Construction: For a new input, include those labels or intervals that meet the calibrated nonconformity threshold. \[ \mathcal{C}(X_{test}) = \{y : s(X_{test}, y) \leq \hat{q}\}\]

Desirable Properties of Conformal Prediction

  • Validity: The prediction sets have the desired coverage.
  • Efficiency: The prediction sets are as small as possible.
  • Adaptivity: The prediction sets are tailored to the difficulty of each instance.

  • Conditional Coverage: \[\mathbb{P}\left[ Y_{\text{test}} \in \mathcal{C}(X_{\text{test}}) \mid X_{\text{test}} \right] \geq 1 - \alpha.\]

Classification with Adaptive Prediction

  • Adaptive Methods: Adjust prediction sets based on instance difficulty.
  • One idea: if the softmax outputs \(\hat{f}(X_{test})\) were a perfect model of \(Y_{test}| X_{test}\), we would greedily include the top-scoring classes until their total mass exceeded \(1 − \alpha\).
  • Define a score: \[s(x, y) = \sum_{j=1}^k \hat{f}(x)_{\pi_j(x)}, \quad \text{where } y = \pi_k(x).\] where \(\pi_j(x)\) is a permutation that sorts \(\hat{f}(X_{test})\) in descending order.
  • Form the prediction set: \[\small C(x) = \{\pi_1(x), \pi_2(x), \ldots, \pi_k(x)\}, \quad \text{where } k = \sup \left\{ k' : \sum_{j=1}^{k'} \hat{f}(x)_{\pi_j(x)} < \hat{q} \right\} + 1\]
  • Advantages: Can reduce prediction set sizes without compromising coverage.

Conformalized Quantile Regression

  • Process:
    1. Fit Quantile Regression: Estimate conditional quantiles, \(\hat{t}_{\gamma}(x)\).
    2. Compute Nonconformity Scores: \(s(x, y) = \max \left\{ \hat{t}_{\alpha/2}(x) - y, \; y - \hat{t}_{1-\alpha/2}(x) \right\}.\)
    3. Compute Quantile: \(\hat{q} = \text{Quantile}(s_1, s_2, \ldots, s_n; \lceil (n+1)(1-\alpha) \rceil / n)\)
    4. Adjust intervals to achieve the desired coverage: \[C(x) = [\hat{t}_{\alpha/2}(x) - \hat{q}, \;\; \hat{t}_{1-\alpha/2}(x) + \hat{q}]\]

Evaluating Adaptivity

  • Objective: Assess how well adaptive methods improve efficiency.
  • Efficiency: Size or volume of prediction sets.
  • Coverage: Proportion of true labels contained in prediction sets.
  • Size-stratified coverage (SSC) metric:
  • Discretize \(\mathcal{C}(x)\) into \(G\) bins based on the size of the prediction set.
  • Let \(I_g \subset \{1, \ldots, n_{\text{val}}\}\) be the set of observations falling in bin \(g\) for \(g = 1, \ldots, G\).
  • \[\text{SSC Metric:} \quad \min_{g \in \{1, \ldots, G\}} \frac{1}{|I_g|} \sum_{i \in I_g} \mathbf{1}\{Y_i \in C(X_i)\}\] Observed coverage for all units for which the set size \(|C(x)|\) falls into bin \(g\)

Checking for Correct Coverage

  • Conditional Coverage Diagnostic: Empirically test if the prediction sets achieve the intended coverage. \[C_j = \frac{1}{n_{\text{val}}} \sum_{i=1}^{n_{\text{val}}} \mathbf{1}\left\{ Y_{i,j}^{(\text{val})} \in C_j\left(X_{i,j}^{(\text{val})}\right) \right\}, \quad \text{for } j = 1, \ldots, R,\]

  • Approximate test: \[\overline{C} = \frac{1}{R} \sum_{j=1}^R C_j \approx 1 - \alpha\]

    This ensures that for almost every instance \(X_{test}\), the probability of containing the true label is at least \(1 - \alpha\).

  • Adjustments:

    • If coverage is off, adjust the nonconformity scores or recalibrate.

Effect of Calibration Size

  • Calibration Set Size:

    • Influences the variability of the nonconformity scores.
    • Larger sets yield more stable and reliable prediction sets.
  • Key idea: the coverage of conformal prediction conditionally on the calibration set is a random quantity

  • Coverage Distribution: \[P(Y_{test} \in C(X_{test}) \mid \{X_i, Y_i)\}_{i=1}^n) \sim \text{Beta}(n+1-l, l), \quad l = \lfloor (n+1)\alpha \rfloor\]

MNIST Classification Example

import os
os.environ["KERAS_BACKEND"] = "torch"
import keras
from keras import layers
from keras.datasets import mnist
from sklearn.model_selection import train_test_split
import numpy as np


# 1. Load and preprocess the MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train[:5_000]
x_train = x_train.reshape((x_train.shape[0], 28 * 28)).astype("float32") / 255
x_test = x_test.reshape((x_test.shape[0], 28 * 28)).astype("float32") / 255
x_test, x_cal, y_test, y_cal = train_test_split(
    x_test, y_test, test_size=0.5, random_state=42
)

num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# 2. Build a simple model
model = keras.Sequential(
    [
        layers.Dense(64, activation="relu", input_shape=(784,)),
        layers.Dense(64, activation="relu"),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

# 3. Compile the model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

# 4. Train the model
model.fit(x_train, y_train, epochs=1, batch_size=128, validation_split=0.1)

# 5. Evaluate on the test set
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print("Test accuracy:", test_acc) # Test accuracy: 0.83

Conformal prediction is easy to implement

# 1: get conformal scores
n = y_cal.shape[0]
cal_smx = model(x_cal).softmax(dim=1).detach().cpu().numpy()
cal_scores = 1 - cal_smx[np.arange(n), y_cal]

# 2: get adjusted quantile
alpha = 0.1
q_level = np.ceil((n + 1) * (1 - alpha)) / n
qhat = np.quantile(cal_scores, q_level, method="higher")
test_smx = model(x_test).softmax(dim=1).detach().cpu().numpy()

# 3: form prediction sets
prediction_sets = test_smx >= (1 - qhat)

Results

for i in range(5):
    sets = np.where(prediction_sets[i])[0]
    label = np.argmax(y_test[i])
    print(f"Prediction set for image {i}: {sets}, True label: {label}")

Prediction set for image 0: [8], True label: 8

Prediction set for image 1: [4 9], True label: 4

Prediction set for image 2: [3 5], True label: 3

Prediction set for image 3: [1], True label: 1

Prediction set for image 4: [2], True label: 2

Selective Classification

  • Concept: The model can choose to abstain from making a prediction when uncertain.
  • Benefits:
    • Improves overall accuracy on predictions made.
    • Controls the error rate by not predicting on ambiguous instances.
  • Implementation:
    • Set a confidence threshold.
    • Only make predictions when the model’s confidence exceeds this threshold.

Selective Classification

More formally, given image-class pairs \(\{(X_i, Y_i)\}_{i=1}^n\) and an image classifier \(\hat{f}\), we seek:

\[ \mathbb{P}\left( Y_{\text{test}} = \hat{Y}(X_{\text{test}}) \;\middle|\; \hat{P}(X_{\text{test}}) \geq \hat{\lambda} \right) \geq 1 - \alpha, \]

where \(\hat{Y}(x) = \arg\max_y \hat{f}(x)_y\), \(\hat{P}(X_{\text{test}}) = \max_y \hat{f}(X_{\text{test}})_y\), and \(\hat{\lambda}\) is a threshold chosen using the calibration set.

Review

  • Conformal Prediction: A simple, inexpensive, distribution-free framework for uncertainty quantification.
  • Adaptivity: Tailoring prediction sets to individual instances enhances efficiency.
  • Practical Considerations:
    • Calibration size matters.
    • Verify empirical and conditional coverage to ensure correctness.
  • Extensions: Multilabel and selective classification highlight the flexibility and adaptability of conformal methods.

Concluding Thoughts

  • Should you read it? Yes, but not the whole thing. Just use it as a reference.
  • Is it easy to implement, yes. They provide code for all of their examples.
  • Many extensions: online setting, conditional coverage, etc.
  • See this repo for many good sources: